Basic Examples of using Abydos

We start by importing the phonetic & distance modules of Abydos, along with Pandas.


In [1]:
from abydos.phonetic import *
from abydos.distance import *

import pandas as pd

The we load some data into a DataFrame. In this case, we'll load the US Census surnames data ranked by frequency.


In [2]:
names = pd.read_csv('../tests/corpora/uscensus2000.csv',
                    comment='#', index_col=1, usecols=(0,1), keep_default_na=False)
names.head()


Out[2]:
name
rank
1 SMITH
2 JOHNSON
3 WILLIAMS
4 BROWN
5 JONES

We can create a dictionary of Soundex values mapping to all the surnames with the same Soundex code. These represent Soundex collisions (or blocking). Getting the basic Soundex value of a string is as simple as calling soundex() on it.


In [3]:
soundex('WILLIAMSON')


Out[3]:
'W452'

Better yet, we can construct a Soundex() object to reuse for encoding multiple names.


In [4]:
sdx = Soundex()
reverse_soundex = {}
for name in names.name:
    encoded = sdx.encode(name)
    if encoded not in reverse_soundex:
        reverse_soundex[encoded] = set()
    reverse_soundex[encoded].add(name)

With this dictionary, we can retrieve all the names that map to the same Soundex value as, for example, the name Williamson.


In [5]:
reverse_soundex[soundex('WILLIAMSON')]


Out[5]:
{'WALENGA',
 'WALING',
 'WALINSKI',
 'WALLENIUS',
 'WALLENS',
 'WALLENSTEIN',
 'WALLING',
 'WALLINGA',
 'WALLINGER',
 'WALLINGFORD',
 'WALLINGSFORD',
 'WALLINGTON',
 'WALMSLEY',
 'WEHLING',
 'WELENC',
 'WELLENS',
 'WELLENSTEIN',
 'WELLING',
 'WELLINGER',
 'WELLINGHOFF',
 'WELLINGS',
 'WELLINGTON',
 'WELLINS',
 'WELLMAKER',
 'WELLONS',
 'WELMAKER',
 'WELNIAK',
 'WHALING',
 'WHEELING',
 'WHEELINGTON',
 'WIELENGA',
 'WIELINSKI',
 'WILAMOWSKI',
 'WILENS',
 'WILENSKY',
 'WILINSKI',
 'WILLAIMS',
 'WILLAMSON',
 'WILLEMS',
 'WILLEMSE',
 'WILLEMSEN',
 'WILLEMSSEN',
 'WILLENS',
 'WILLIAMS',
 'WILLIAMSBEY',
 'WILLIAMSBROWN',
 'WILLIAMSEN',
 'WILLIAMSJONES',
 'WILLIAMSLEE',
 'WILLIAMSMAE',
 'WILLIAMSON',
 'WILLIAMSSMITH',
 'WILLIAMSTON',
 'WILLIANSON',
 'WILLIMAS',
 'WILLIMSON',
 'WILLING',
 'WILLINGER',
 'WILLINGHAM',
 'WILLINGS',
 'WILLINGTON',
 'WILLINK',
 'WILLINS',
 'WILLLIAMS',
 'WILLMES',
 'WILLMS',
 'WILMES',
 'WILMS',
 'WILMSEN',
 'WILMSMEYER',
 'WOHLENHAUS',
 'WOLANSKI',
 'WOLANSKY',
 'WOLENSKI',
 'WOLENSKY',
 'WOLINSKI',
 'WOLINSKY',
 'WOLLENZIEN',
 'WOLNIAK',
 'WOLNIEWICZ',
 'WOLNIK',
 'WOLYNIEC',
 'WOOLEMS',
 'WOOLINGTON',
 'WOOLLUMS',
 'WOOLUMS'}

We can build up a DataFrame with some interesting information about these names. First, we'll just collect all the names in a column.


In [6]:
df = pd.DataFrame(sorted(reverse_soundex[soundex('WILLIAMSON')]), columns=['name'])
df


Out[6]:
name
0 WALENGA
1 WALING
2 WALINSKI
3 WALLENIUS
4 WALLENS
5 WALLENSTEIN
6 WALLING
7 WALLINGA
8 WALLINGER
9 WALLINGFORD
10 WALLINGSFORD
11 WALLINGTON
12 WALMSLEY
13 WEHLING
14 WELENC
15 WELLENS
16 WELLENSTEIN
17 WELLING
18 WELLINGER
19 WELLINGHOFF
20 WELLINGS
21 WELLINGTON
22 WELLINS
23 WELLMAKER
24 WELLONS
25 WELMAKER
26 WELNIAK
27 WHALING
28 WHEELING
29 WHEELINGTON
... ...
56 WILLING
57 WILLINGER
58 WILLINGHAM
59 WILLINGS
60 WILLINGTON
61 WILLINK
62 WILLINS
63 WILLLIAMS
64 WILLMES
65 WILLMS
66 WILMES
67 WILMS
68 WILMSEN
69 WILMSMEYER
70 WOHLENHAUS
71 WOLANSKI
72 WOLANSKY
73 WOLENSKI
74 WOLENSKY
75 WOLINSKI
76 WOLINSKY
77 WOLLENZIEN
78 WOLNIAK
79 WOLNIEWICZ
80 WOLNIK
81 WOLYNIEC
82 WOOLEMS
83 WOOLINGTON
84 WOOLLUMS
85 WOOLUMS

86 rows × 1 columns

To that, let's add a few distance measures.


In [7]:
# Levenshtein distance from 'WILLIAMSON'
lev = Levenshtein()
df['Levenshtein'] = df.name.apply(lambda name: lev.dist_abs('WILLIAMSON', name))
# Jaccard similarity on 2-grams
jac = Jaccard()
df['Jaccard'] = df.name.apply(lambda name: jac.sim('WILLIAMSON', name))
# Jaro-Winkler similarity
jw = JaroWinkler()
df['Jaro_Winkler'] = df.name.apply(lambda name: jw.sim('WILLIAMSON', name))

And finally, we'll add a few phonetic encodings.


In [8]:
# Double Metaphone (first code only)
dm = DoubleMetaphone()
df['Double_Metaphone'] = df.name.apply(lambda name: dm.encode(name)[0])
# NYSIIS
nysiis = NYSIIS()
df['NYSIIS'] = df.name.apply(lambda name: nysiis.encode(name))
# Alpha-SIS (first code only)
alphasis = AlphaSIS()
df['Alpha_SIS'] = df.name.apply(lambda name: alphasis.encode(name)[0])

In [9]:
df


Out[9]:
name Levenshtein Jaccard Jaro_Winkler Double_Metaphone NYSIIS Alpha_SIS
0 WALENGA 8 0.055556 0.465079 ALNK WALANG 45270000000000
1 WALING 7 0.125000 0.605556 ALNK WALANG 45270000000000
2 WALINSKI 6 0.111111 0.755000 ALNSK WALANS 45207000000000
3 WALLENIUS 7 0.105263 0.757619 ALNS WALAN 45200000000000
4 WALLENS 6 0.117647 0.737143 ALNS WALAN 45200000000000
5 WALLENSTEIN 7 0.150000 0.604040 ALNSTN WALANS 45201200000000
6 WALLING 6 0.187500 0.787143 ALNK WALANG 45270000000000
7 WALLINGA 6 0.176471 0.755000 ALNK WALANG 45270000000000
8 WALLINGER 6 0.166667 0.730000 ALNKR WALANG 45274000000000
9 WALLINGFORD 6 0.150000 0.683550 ALNKFRT WALANG 45278410000000
10 WALLINGSFORD 6 0.142857 0.765000 ALNKSFRT WALANG 45270841000000
11 WALLINGTON 4 0.294118 0.734286 ALNKTN WALANG 45271200000000
12 WALMSLEY 7 0.111111 0.672222 ALMSL WALNSL 45305000000000
13 WEHLING 7 0.117647 0.573810 ALNK WALANG 45270000000000
14 WELENC 8 0.058824 0.511111 ALNK WALANC 45270000000000
15 WELLENS 6 0.117647 0.671429 ALNS WALAN 45200000000000
16 WELLENSTEIN 7 0.150000 0.584848 ALNSTN WALANS 45201200000000
17 WELLING 6 0.187500 0.671429 ALNK WALANG 45270000000000
18 WELLINGER 6 0.166667 0.618519 ALNKR WALANG 45274000000000
19 WELLINGHOFF 6 0.150000 0.604040 ALNKF WALANG 45278000000000
20 WELLINGS 5 0.176471 0.672222 ALNKS WALANG 45270000000000
21 WELLINGTON 4 0.294118 0.622222 ALNKTN WALANG 45271200000000
22 WELLINS 5 0.187500 0.737143 ALNS WALAN 45200000000000
23 WELLMAKER 6 0.105263 0.618519 ALMKR WALNAC 45374000000000
24 WELLONS 6 0.187500 0.787143 ALNS WALAN 45200000000000
25 WELMAKER 7 0.052632 0.550000 ALMKR WALNAC 45374000000000
26 WELNIAK 6 0.117647 0.573810 ALNK WALNAC 45270000000000
27 WHALING 7 0.117647 0.671429 ALNK WALANG 45270000000000
28 WHEELING 8 0.111111 0.550000 ALNK WALANG 45270000000000
29 WHEELINGTON 6 0.210526 0.518182 ALNKTN WALANG 45271200000000
... ... ... ... ... ... ... ...
56 WILLING 5 0.357143 0.891429 ALNK WALANG 45270000000000
57 WILLINGER 5 0.312500 0.853333 ALNKR WALANG 45274000000000
58 WILLINGHAM 5 0.375000 0.895000 ALNKM WALANG 45273000000000
59 WILLINGS 4 0.333333 0.886429 ALNKS WALANG 45270000000000
60 WILLINGTON 3 0.466667 0.851429 ALNKTN WALANG 45271200000000
61 WILLINK 5 0.357143 0.891429 ALNK WALANC 45270000000000
62 WILLINS 4 0.357143 0.911429 ALNS WALAN 45200000000000
63 WILLLIAMS 3 0.615385 0.937778 ALLMS WALAN 45530000000000
64 WILLMES 5 0.266667 0.891429 ALMS WALN 45300000000000
65 WILLMS 4 0.384615 0.920000 ALMS WALN 45300000000000
66 WILMES 6 0.200000 0.844444 ALMS WALN 45300000000000
67 WILMS 5 0.307692 0.883333 ALMS WALN 45300000000000
68 WILMSEN 4 0.357143 0.873333 ALMSN WALNSA 45302000000000
69 WILMSMEYER 7 0.222222 0.666667 ALMSMR WALNSN 45303400000000
70 WOHLENHAUS 8 0.047619 0.600000 ALNS WALAN 45200000000000
71 WOLANSKI 6 0.052632 0.641667 ALNSK WALANS 45207000000000
72 WOLANSKY 6 0.052632 0.633333 ALNSK WALANS 45207000000000
73 WOLENSKI 7 0.052632 0.550000 ALNSK WALANS 45207000000000
74 WOLENSKY 7 0.052632 0.558333 ALNSK WALANS 45207000000000
75 WOLINSKI 6 0.111111 0.575000 ALNSK WALANS 45207000000000
76 WOLINSKY 6 0.111111 0.550000 ALNSK WALANS 45207000000000
77 WOLLENZIEN 6 0.157895 0.600000 ALNSN WALANS 45202000000000
78 WOLNIAK 6 0.117647 0.573810 ALNK WALNAC 45270000000000
79 WOLNIEWICZ 7 0.100000 0.516667 ALNTS WALNAC 45270000000000
80 WOLNIK 7 0.058824 0.488889 ALNK WALNAC 45270000000000
81 WOLYNIEC 8 0.052632 0.447222 ALNK WALYNA 45270000000000
82 WOOLEMS 6 0.117647 0.657143 ALMS WALAN 45300000000000
83 WOOLINGTON 5 0.222222 0.533333 ALNKTN WALANG 45271200000000
84 WOOLLUMS 6 0.176471 0.737500 ALMS WALAN 45300000000000
85 WOOLUMS 6 0.117647 0.657143 ALMS WALAN 45300000000000

86 rows × 7 columns

Let's check the row for WILLIAMSON.


In [10]:
df[df.name == 'WILLIAMSON']


Out[10]:
name Levenshtein Jaccard Jaro_Winkler Double_Metaphone NYSIIS Alpha_SIS
50 WILLIAMSON 0 1.0 1.0 ALMSN WALANS 45302000000000

In addition to their Soundex collision, 7 names have matching first Double Metaphone encodings.


In [11]:
df[df.Double_Metaphone == 'ALMSN']


Out[11]:
name Levenshtein Jaccard Jaro_Winkler Double_Metaphone NYSIIS Alpha_SIS
37 WILLAMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000
40 WILLEMSEN 3 0.400000 0.895556 ALMSN WALANS 45302000000000
41 WILLEMSSEN 4 0.375000 0.880000 ALMSN WALANS 45302000000000
46 WILLIAMSEN 1 0.692308 0.960000 ALMSN WALANS 45302000000000
50 WILLIAMSON 0 1.000000 1.000000 ALMSN WALANS 45302000000000
55 WILLIMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000
68 WILMSEN 4 0.357143 0.873333 ALMSN WALNSA 45302000000000

28 have matching NYSIIS encodings.


In [12]:
df[df.NYSIIS == 'WALANS']


Out[12]:
name Levenshtein Jaccard Jaro_Winkler Double_Metaphone NYSIIS Alpha_SIS
2 WALINSKI 6 0.111111 0.755000 ALNSK WALANS 45207000000000
5 WALLENSTEIN 7 0.150000 0.604040 ALNSTN WALANS 45201200000000
16 WELLENSTEIN 7 0.150000 0.584848 ALNSTN WALANS 45201200000000
31 WIELINSKI 5 0.166667 0.760000 ALNSK WALANS 45207000000000
34 WILENSKY 6 0.176471 0.633333 ALNSK WALANS 45207000000000
35 WILINSKI 5 0.250000 0.795833 ALNSK WALANS 45207000000000
37 WILLAMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000
39 WILLEMSE 4 0.333333 0.870000 ALMS WALANS 45300000000000
40 WILLEMSEN 3 0.400000 0.895556 ALMSN WALANS 45302000000000
41 WILLEMSSEN 4 0.375000 0.880000 ALMSN WALANS 45302000000000
44 WILLIAMSBEY 3 0.533333 0.905455 ALMSP WALANS 45309000000000
45 WILLIAMSBROWN 3 0.562500 0.953846 ALMSPRN WALANS 45309420000000
46 WILLIAMSEN 1 0.692308 0.960000 ALMSN WALANS 45302000000000
47 WILLIAMSJONES 3 0.562500 0.953846 ALMSNS WALANS 45306200000000
48 WILLIAMSLEE 3 0.533333 0.905455 ALMSL WALANS 45305000000000
49 WILLIAMSMAE 3 0.533333 0.905455 ALMSM WALANS 45303000000000
50 WILLIAMSON 0 1.000000 1.000000 ALMSN WALANS 45302000000000
51 WILLIAMSSMITH 5 0.470588 0.883077 ALMSM0 WALANS 45303100000000
52 WILLIAMSTON 1 0.769231 0.981818 ALMSTN WALANS 45301200000000
53 WILLIANSON 1 0.692308 0.937778 ALNSN WALANS 45202000000000
55 WILLIMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000
71 WOLANSKI 6 0.052632 0.641667 ALNSK WALANS 45207000000000
72 WOLANSKY 6 0.052632 0.633333 ALNSK WALANS 45207000000000
73 WOLENSKI 7 0.052632 0.550000 ALNSK WALANS 45207000000000
74 WOLENSKY 7 0.052632 0.558333 ALNSK WALANS 45207000000000
75 WOLINSKI 6 0.111111 0.575000 ALNSK WALANS 45207000000000
76 WOLINSKY 6 0.111111 0.550000 ALNSK WALANS 45207000000000
77 WOLLENZIEN 6 0.157895 0.600000 ALNSN WALANS 45202000000000

And 7 have matching first Alpha-SIS encodings.


In [13]:
df[df.Alpha_SIS == '45302000000000']


Out[13]:
name Levenshtein Jaccard Jaro_Winkler Double_Metaphone NYSIIS Alpha_SIS
37 WILLAMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000
40 WILLEMSEN 3 0.400000 0.895556 ALMSN WALANS 45302000000000
41 WILLEMSSEN 4 0.375000 0.880000 ALMSN WALANS 45302000000000
46 WILLIAMSEN 1 0.692308 0.960000 ALMSN WALANS 45302000000000
50 WILLIAMSON 0 1.000000 1.000000 ALMSN WALANS 45302000000000
55 WILLIMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000
68 WILMSEN 4 0.357143 0.873333 ALMSN WALNSA 45302000000000

6 names match in all four of the phonetic algorithms considered here.


In [14]:
df[(df.Alpha_SIS == '45302000000000') & (df.NYSIIS == 'WALANS') &
   (df.Double_Metaphone == 'ALMSN')]


Out[14]:
name Levenshtein Jaccard Jaro_Winkler Double_Metaphone NYSIIS Alpha_SIS
37 WILLAMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000
40 WILLEMSEN 3 0.400000 0.895556 ALMSN WALANS 45302000000000
41 WILLEMSSEN 4 0.375000 0.880000 ALMSN WALANS 45302000000000
46 WILLIAMSEN 1 0.692308 0.960000 ALMSN WALANS 45302000000000
50 WILLIAMSON 0 1.000000 1.000000 ALMSN WALANS 45302000000000
55 WILLIMSON 1 0.750000 0.980000 ALMSN WALANS 45302000000000